In [1]:
'''
%%bash
ipython nbconvert --to latex --template citations.tplx Copula_Intro.ipynb
pdflatex Copula_Intro.tex
bibtex Copula_Intro_Eq
pdflatex Copula_Intro.tex
'''
Out[1]:
'\n%%bash\nipython nbconvert --to latex --template citations.tplx Copula_Intro.ipynb\npdflatex Copula_Intro.tex\nbibtex Copula_Intro_Eq\npdflatex Copula_Intro.tex\n'

An introduction to the concept of Copula in statistical simulations

As a person who has a fair understandig of basic stochastic processes but not really a statistician or anything like that, I had come across the term Copula in statistical simulations but never really understood what that is. Like many other people, I also referred to Wikipedia to get a quick answer to at least understand the intuition behind it, but it was not really convincing, at least to me. Well, now after spending a little time to read a few blog posts and an introduction included in the documentation of Statistics and Machine Learning Toolbox of MATLAB, Wikipedia's definition makes sense too but it was not straightforward at the beginning.

To the extent that I understood, the story begins from the simulation of a multivariate distribution. Well, the most obvious case, would be multivariate normal distribution. To make even simpler, let us start with bivariate normal distribution. If we only have the mean and the covariance matrix, we know everything about such a distribution. Statisticians call this set (mean and covairace matrix in this case) "sufficient statistics".

We can simply generate random samples of a bivariate distribution with a given mean and covariance matrix.

In [2]:
%matplotlib inline

import pandas as pd
import numpy as np
import statsmodels.api as sm

import plotly
import plotly.express as px

from scipy.stats import t
from scipy.stats import norm
from math import inf

from scipy.stats import gamma, kendalltau, norm

#from scipy.stats import multivariate_t

from scipy.io import loadmat
import scipy as sc
import matplotlib.pyplot as plt
In [3]:
plotly.offline.init_notebook_mode()
In [4]:
np.random.seed(seed=4) # for reproducibility
mean = [0,0]
rho = 0.7
cov = [[1,rho],[rho,1]] # diagonal covariance, rho: correlation of first and second variable in this bivariate distribtuion

norm1,norm2 = np.random.multivariate_normal(mean,cov,2000).T
In [5]:
norm_data = pd.concat([pd.DataFrame(norm1), pd.DataFrame(norm2)], axis=1)
norm_data.columns = ['Vrb1', 'Vrb2']
norm_data.corr()
Out[5]:
Vrb1 Vrb2
Vrb1 1.00000 0.69251
Vrb2 0.69251 1.00000
In [6]:
fig = px.scatter(norm_data, x = 'Vrb1', y='Vrb2', width=800, height=500, 
                 trendline='ols', trendline_color_override='cyan', marginal_x='histogram', marginal_y='histogram', 
                 title='Bivariate Normal distribution')
fig.show()
fig.write_html("./fig1.html")

However, the problem becomes more difficult when the distribution is arbitray. Especially, the mean and covriance won't be sufficient statistics to generate the random samples following an arbitrary multivariate distribution. Here, the concept of Copula comes in handy. I assume that reader of this notebook may have come across "inversion method" to generate samples from a random variables. The point is the that cumulative distribution function (CDF) is a function which goes for the set of real numbers to an interval of [0,1]. Since a CDF is monotonic and injective (one-to-one), it is invertible. Now if we have a random generator with a uniform distribution in the interval of [0,1] and apply the inverse of the CDF to the uniformly generated samples, the result will be a random variable with a distribution characterized by the given CDF. Conversly, if a we have any random variable and calculate its CDF, the result will have a uniform distribution. It is quite simple to prove this mathematically.

Let $X_1$ be a random variable with a given CDF, such that, $ u_1 = CDF(X_1)$, then: $$ U_1 \sim \text{Uniform}$$

Proof:

$$ CDF_u(u) = P(U \leq u) = P(CDF(x) \leq u)$$

since $CDF^{-1}$ is ascending, it can be inserted to both side of the inequality, so we will have:

$$ P(CDF^{-1}(CDF(x)) \leq CDF^{-1}(u)) = P(x \leq CDF^{-1}(u))$$

this probability, dy definition, equals: $$ CDF(CDF^{-1}(u))=u$$ thus, $CDF_u(u) = u $ and this by defintion delineates a uniform distribution. $ \blacksquare $

This can be seen in the following figure:

In [7]:
unif1 = norm.cdf(norm1)
unif2 = norm.cdf(norm2)

cdf_data= pd.concat([pd.DataFrame(unif1), pd.DataFrame(unif2)], axis=1)
cdf_data.columns = ['CDF_X', 'CDF_Y']
fig = px.scatter(cdf_data, x = 'CDF_X', y='CDF_Y', width=800, height=500, 
                 trendline='ols', trendline_color_override='cyan', marginal_x='histogram',
                 marginal_y='histogram', title='Bivariate uniform')
fig.show()
fig.write_html("./fig2.html")
In [8]:
tau_norm, _ =kendalltau(norm1, norm2) # Kendall tau
print(tau_norm)

tau, _ =kendalltau(unif1, unif2) # Kendall tau
print(tau)
0.48617108554277133
0.48617108554277133

As proven above, the output of the $CDF$ is uniformly distributed and this can be seen in the figure above. Another interesting observation is that the rank-based correlation (e.g. Kendal $\tau$) between normally distributed random variable is preserved between the uniformly distributed variables.

Now that we have some uinformly distributed variables with certain correlation between them, if we apply the inverse of an arbitrary $CDF$ to each of the unifromly dirtributed variables, we will get a bivariate distribution whose marginals are the selected arbitray $CDF$ and we will see that they also preserve the rank-based correlation between the variates of the distribution. This constitutes the basis of the concept of Copula. It can be generalized to multivarie distributions too.

In the following example, I used the inverse of a gamma distribution and a t-distribution to simulate a bivariate distribution.

In [9]:
x1 = gamma.ppf(unif1, a=2)
x2 = t.ppf(unif2, df=5)

cdf_data= pd.concat([pd.DataFrame(x1), pd.DataFrame(x2)], axis=1)
cdf_data.columns = ['gamma_X1', 't_X2']
fig = px.scatter(cdf_data, x = 'gamma_X1', y='t_X2', width=800, height=500, 
                 trendline='ols', trendline_color_override='cyan', marginal_x='histogram',
                 marginal_y='histogram', title='Bivariate with given marginals')
fig.show()
fig.write_html("./fig3.html")
In [10]:
tau_margins, _ =kendalltau(x1, x2) # Kendall tau
print(tau_margins)
0.48617108554277133

Note here that the correlation is preseved in the new bivariate distribution.

Defintion of Copula

Copula is a function which satisfies certain conditions, most importantly, it represent a multivariate distribution whose martginal distributions are all uniformly distributed. Having a Copula function (C) makes the simulation of a multivariate distribution simpler. This follows Sklar's theorem which states that for an N-dimensional distribution with continuous margins $F_1, F_2,..., F_N$, there exist a Copula function for which:

$$F(x_1, x_2,..., x_N)=C(F_1(x_1), F_2(x_2),..., F_N(x_N))$$

Note here $F(x_i)$ is a comulative distribution function and will have a uniform distribution. Similar to the relationship between cumulative distribution function and probability distribution function where $f(x)=\frac{dF}{dx}$, partial derivative of Copula function is defined as below: $$c(u_1, u_2,..., u_N)=\frac{\partial C(u_1, u_2,..., u_N)}{\partial u_1 \partial u_2 ... \partial u_3}$$

then, the probability distribution function of the multivariate distribution will be: $$f(x_1, x_2,..., x_N)=c(F_1(x_1), F_2(x_2),..., F_N(x_N))\prod_{i=1}^{N} f_n(x_i)$$

Now imagine we have some realizations of two random variables that I call them original random variables and we'd like to simulate a bivariate distribution whose marginals follow a distribution corresponding to the original random variables with certain correlation.

In [11]:
stock_data = loadmat('.\stockreturns')
print(stock_data.keys())
dict_keys(['__header__', '__version__', '__globals__', 'stocks'])
In [12]:
np.shape(stock_data["stocks"]) # let us imagine the first and second columns in this array are our two realizations from our 
Out[12]:
(100, 10)
In [13]:
invCDF1 = np.sort(stock_data["stocks"][:,1]);
n1 = np.shape(stock_data["stocks"])[0];
invCDF2 = np.sort(stock_data["stocks"][:,2]);
n2 = n1;

mean = [0,0]
rho = 0.7
cov = [[1,rho],[rho,1]] # 
norm1,norm2 = np.random.multivariate_normal(mean,cov,2000).T
unif1 = norm.cdf(norm1)
unif2 = norm.cdf(norm2)
In [14]:
ind1 = np.floor(n1*unif1).tolist()
ind1_int=[int(i) for i in ind1]
ind2 = np.floor(n2*unif2).tolist()
ind2_int=[int(i) for i in ind2]
In [15]:
X1 = invCDF1[ind1_int]
X2 = invCDF2[ind2_int]

cdf_data= pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
cdf_data.columns = ['stock1', 'stock2']
fig = px.scatter(cdf_data, x = 'stock1', y='stock2', width=800, height=500, 
                 trendline='ols', trendline_color_override='cyan', marginal_x='histogram',
                 marginal_y='histogram', title='Bivariate with marginals based on observed realizations')
fig.show()
fig.write_html("./fig4.html")

Obviously, there are a lot more about the concept of Copulas and how to simulate various forms of correlation structure. For example, the correlations which varies for extreme values and here I only mentioned a few things about what is known as elliptical Copulas. Anyhow, this notebook has used several other tutorials to put together a hopefully concise basic introduction to this concept.